Purpose of the data set: This crime dataset was supplied by LA County. It is Crime Data from 2010 to 2019 and it uses uniform crime reporting (UCR) created by the FBI to summarize incident info in a repeatable and comparable way to other cities, counties, and states. The LA City Mayor wrote that the purpose of the data was for transparency's sake as well as encouraging those outside of the government to be able to use the data for the sake of innovation and solving problems. [Ref: https://data.lacity.org/]
Data Source: The data set used for the purposes of this project is sourced from LOS ANGELES OPEN DATA.
Data Importance: Analyzing Crime data over several years helps identify high-risk populations and demographies and take corrective action to reduce crime and make the city safer.
Mining useful knowledge and measuring effectiveness of prediction algorithm:
For this data set we would like to predict the following features:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import descartes
import plotly.express as px
import matplotlib.pylab as pl
#from plotly.offline import iplot_mpl
import seaborn as sns
import matplotlib as mpl
from matplotlib import cm
import itertools
#pip install geopandas
import geopandas as gpd
from shapely.geometry import Point, Polygon
%matplotlib inline
# Load the LA Crimes data set into pandas dataframe
df = pd.read_csv("Data/Crime_Data_from_2010_to_2019.csv")
# List the total number of rows and columns in the dataframe
print("Total number of rows in the dataframe: " + str(df.shape[0]))
print("Total number of columns in the dataframe: " + str(df.shape[1]))
# Displaying the data types of each column/attribute
df.info()
Based on the outputs above we have the following observations for the LA Crimes data set:
A detailed description of each attribute along with its meaning and data type are displayed in a table below.
[Ref: Description information in the table below is sourced from LOS ANGELES OPEN DATA.]
# Load Data Description file into pandas dataframe
data_desc = pd.read_csv('Data/Data_Description.csv')
from IPython.display import display, HTML
display(HTML(data_desc.to_html()))
Additional details about the data set can be found in the links below:
Link to Modus Operendi codes
Link to LAPD Reporting Districts
# Displaying the first 2 lines of the dataframe
df.head(2)
# Summary of attributes in the dataframe
df.describe().round(2)
# Display missing values in the Column "Vict Descent"
print("Missing values in the Column 'Vict Descent' are: " + str(df['Vict Descent'].isnull().sum()))
# Display number of victims grouped by their descent
# Descent Code:
# A: Other Asian, B: Black, C: Chinese, D: Cambodian, F: Filipino, G: Guamanian, H: Hispanic/Latin/Mexican,
# I: American Indian/Alaskan Native, J: Japanese, K: Korean, L: Laotian, O: Other, P: Pacific Islander,
# S: Samoan, U: Hawaiian, V: Vietnamese, W: White, X: Unknown, Z: Asian Indian
print()
print("Number of victims grouped by their descent are as below:")
df.groupby('Vict Descent').size()
Based on the outputs above we have the following observations for the 'Vict Descent' column:
Based on further analysis by looking at a random selection of records it is concluded that these could have been data entry errors. Hence, we can replace those records with Descent Code 'X', which is for Unknown category.
# Replace records in the 'Vict Descent' column having '-' with 'X'
df['Vict Descent'] = df['Vict Descent'].replace(to_replace='-',value='X')
# Replace records in the 'Vict Descent' column having blanks with 'X'
df['Vict Descent'].fillna('X', inplace = True)
# Display missing values in the Column "Vict Descent" and also display the number of victims grouped by their descent
print("Missing values in the Column 'Vict Descent' are: " + str(df['Vict Descent'].isnull().sum()))
print()
print("Number of victims grouped by their descent are as below:")
df.groupby('Vict Descent').size()
As seen in the output above, the Vict Descent category 'X' now has 274905 records. All the data entry errors are now corrected.
# Display missing values in the Column "Vict Sex"
print("Missing values in the Column 'Vict Sex' are: " + str(df['Vict Sex'].isnull().sum()))
# Display number of victims grouped by their Sex
# Sex Code:
# F: Female, M: Male, X: Unknown
print()
print("Number of victims grouped by their Sex are as below:")
df.groupby('Vict Sex').size()
Based on the outputs above we have the following observations for the 'Vict Sex' column:
Based on further analysis by looking at a random selection of records it is concluded that these could have been data entry errors. Hence, we can replace those records with Sex Code 'X', which is for Unknown category.
# Replace records in the 'Vict Sex' column having '-' with 'X'
df['Vict Sex'] = df['Vict Sex'].replace(to_replace='-',value='X')
# Replace records in the 'Vict Sex' column having 'H' with 'X'
df['Vict Sex'] = df['Vict Sex'].replace(to_replace='H',value='X')
# Replace records in the 'Vict Sex' column having 'N' with 'X'
df['Vict Sex'] = df['Vict Sex'].replace(to_replace='N',value='X')
# Replace records in the 'Vict Sex' column having blanks with 'X'
df['Vict Sex'].fillna('X', inplace = True)
# Display missing values in the Column "Vict Sex" and also display the number of victims grouped by their Sex
print("Missing values in the Column 'Vict Sex' are: " + str(df['Vict Sex'].isnull().sum()))
print()
print("Number of victims grouped by their Sex are as below:")
df.groupby('Vict Sex').size()
As seen in the output above, the Vict Sex category 'X' now has 251927 records. All the data entry errors are now corrected.
# Display missing values in the Column "Vict Age"
print("Missing values in the Column 'Vict Age' are: " + str(df['Vict Age'].isnull().sum()))
# Display number of victims grouped by their Age
print()
print("Number of victims grouped by their Age are as below:")
df.groupby('Vict Age').size()
# Display number of victims grouped by their Age, where Age value is less than or equal to zero
print()
print("Number of victims grouped by their Age, where Age value is less than or equal to zero, are as below:")
df[df['Vict Age']<=0].groupby('Vict Age').size()
# Display total number of records with invalid age values
# Age values zero and less are considered invalid age values
print()
print("Total number of records with invalid age values are: " + str( df[df['Vict Age']<=0]['Vict Age'].count() ))
# Display number of victims grouped by their Age, where Age value is greater than or equal to 100
print()
print("Number of victims grouped by their Age, where Age value is greater than or equal to 100, are as below:")
df[df['Vict Age']>=100].groupby('Vict Age').size()
# Display overall median victim age value
print()
print("Overall median victim age value in this data set is: " + str( df['Vict Age'].median() ))
# Display median victim age value of children (>0 & <18)
print()
print("Median victim age value of children (>0 & <18) in this data set is: " +
str( df[(df['Vict Age']>0) & (df['Vict Age']<18)]['Vict Age'].median() ))
# Display number of records grouped by 'Crime Code Description' categories that
# contain the word "CHILD" and with invalid age value less than or equal to zero
df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ].groupby('Crm Cd Desc').size()
# Display number of records grouped by 'Vict Age' that
# contain the word "CHILD" and with invalid age value less than or equal to zero
df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ].groupby('Vict Age').size()
# Display total number of records that contain the word "CHILD" in the Crime Code Desc and
# have an invalid age value less than or equal to zero
temp = df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ]['Vict Age'].count()
print("Total number of records that contain the word 'CHILD' in the Crime Code Desc and have an invalid age value less than or equal to zero are: " + str(temp))
Based on the outputs above we have the following observations for the 'Vict Age' column:
Based on further analysis by looking at a random selection of records it is concluded that these invalid age values could have been data entry errors.
Hence, we have concluded to perfom imputations as below:
# Replace records that contain the word 'CHILD' in the Crime Code Desc and
# have an invalid age value less than or equal to zero, with
# the Median victim age value for children which is 14.
# As seen in the outputs above, for this particular category the invalid age values are [0, -1, -2, -3]
temp = df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ]['Vict Age']
temp = temp.replace(to_replace=[0, -1, -2, -3],value=14)
df.loc[ ( (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ), 'Vict Age'] = temp
# Now replace the remaining records that have an invalid age value less than or
# equal to zero, with the Overall median victim age value which is 32.
# As seen in the outputs above, for this particular category the invalid age values
# are [0, -1, -2, -3, -4, -5, -6, -7, -8, -9]
df.loc[ df['Vict Age']<=0, 'Vict Age' ] = 32
# Display total number of records with invalid age values
# Age values zero and less are considered invalid age values
print()
print("Total number of records with invalid age values are: " + str( df[df['Vict Age']<=0]['Vict Age'].count() ))
# Display number of victims grouped by their Age
print()
print("Number of victims grouped by their Age are as below:")
df.groupby('Vict Age').size()
As seen in the output above, there are now NO records with invalid age values.
# Display missing values in Columns 'LAT' and 'LON'
print()
print("Missing values in the Column 'LAT' are: " + str(df['LAT'].isnull().sum()))
print("Missing values in the Column 'LON' are: " + str(df['LON'].isnull().sum()))
# Display records grouped by LAT values
print()
print("Records grouped by Latitude values:")
df.groupby('LAT').size()
# Display records grouped by LON values
print()
print("Records grouped by Longitude values:")
df.groupby('LON').size()
# Display mean LAT and LON values
print()
mean_LAT = df['LAT'].mean()
mean_LON = df['LON'].mean()
print( "Mean latitude value in the current data set is: " + str(mean_LAT) )
print( "Mean longitude value in the current data set is: " + str(mean_LON) )
Based on the outputs above we have the following observations for the 'LAT' and 'LON' columns:
According to Google maps, the latitude of Los Angeles, CA, USA is 34.052235, and the longitude is -118.243683. Which indicates that the LAT and LON values with 0.0000 are invalid.
Based on further analysis by looking at a random selection of records it is concluded that these invalid values could have been data entry errors.
Hence, we have concluded to replace these invalid values with the 'mean' values.
# Replace invalid 'LAT' and 'LON' values with their mean values respectively
df.loc[ df['LAT']==0, 'LAT' ] = mean_LAT
df.loc[ df['LON']==0, 'LON' ] = mean_LON
# Display min and max values of 'LAT' and 'LON' columns
print()
print( "Min LAT is: " + str(df['LAT'].min()) + " and " + "Max LAT is: " + str(df['LAT'].max()) )
print( "Min LON is: " + str(df['LON'].min()) + " and " + "Max LON is: " + str(df['LON'].max()) )
As seen in the output above, there are now NO records with invalid LAT and LON values.
# Cosmetic correction
# When exploring the column name for AREA, a trailiing space was found.
# Instead of having the column name as 'AREA' we have 'AREA '
# The code in this cell is to remove the trailing space
df = df.rename(columns = {"AREA ":"AREA"})
# Display records grouped by AREA and their corresponding AREA NAME
print()
print("Records grouped by 'AREA' and their corresponding 'AREA NAME':")
df.groupby(['AREA', 'AREA NAME']).size()
# Identify missing values in 'AREA' and 'AREA NAME' columns
print()
print( "Missing values in column 'AREA' are: " + str(df['AREA'].isnull().sum()) )
print( "Missing values in column 'AREA NAME' are: " + str(df['AREA NAME'].isnull().sum()) )
Based on the outputs above we have the following observations for 'AREA' and 'AREA NAME' columns:
Hence no modifications would be made to these columns.
# Identify missing values in 'Status' and 'Status Desc' columns
print()
print( "Missing values in column 'Status' are: " + str(df['Status'].isnull().sum()) )
print( "Missing values in column 'Status Desc' are: " + str(df['Status Desc'].isnull().sum()) )
# Display records grouped by 'Status' and their corresponding 'Status Desc'
print()
print("Records grouped by 'Status' and their corresponding 'Status Desc':")
df.groupby(['Status', 'Status Desc']).size()
Based on the outputs above we have the following observations for the 'Status' and 'Status Desc' columns:
Based on further analysis, by looking at the actual records, we found the following:
Hence, we concluded that these invalid values could have been data entry errors. And we would replace these invalid values with the Status code 'IC' and Status Desc 'Invest Cont' which are the default values for these fields.
# Replace 'UNK' in 'Status Desc' column with 'Invest Cont'
df.loc[ df['Status Desc']=='UNK', 'Status Desc' ] = 'Invest Cont'
# Replace records with status codes '13', '19', 'TH' and 'CC' with 'IC'
df.loc[ df['Status']=='13', 'Status' ] = 'IC'
df.loc[ df['Status']=='19', 'Status' ] = 'IC'
df.loc[ df['Status']=='TH', 'Status' ] = 'IC'
df.loc[ df['Status']=='CC', 'Status' ] = 'IC'
# Replace blanks in 'Status' column with 'IC'
df['Status'].fillna('IC', inplace = True)
# Identify missing values in 'Status' and 'Status Desc' columns
print()
print( "Missing values in column 'Status' are: " + str(df['Status'].isnull().sum()) )
print( "Missing values in column 'Status Desc' are: " + str(df['Status Desc'].isnull().sum()) )
# Display records grouped by 'Status' and their corresponding 'Status Desc'
print()
print("Records grouped by 'Status' and their corresponding 'Status Desc':")
df.groupby(['Status', 'Status Desc']).size()
As seen in the output above, there are now NO records with invalid 'Status' and 'Status Desc' values.
# Identify missing values in 'Crm Cd' and 'Crm Cd Desc' columns
print()
print( "Missing values in column 'Crm Cd' are: " + str(df['Crm Cd'].isnull().sum()) )
print( "Missing values in column 'Crm Cd Desc' are: " + str(df['Crm Cd Desc'].isnull().sum()) )
# Display records grouped by 'Crm Cd' and their corresponding 'Crm Cd Desc'
print()
print("Records grouped by 'Crm Cd' and their corresponding 'Crm Cd Desc':")
df.groupby(['Crm Cd', 'Crm Cd Desc']).size()
As seen above there are NO errors in 'Crm Cd' and 'Crm Cd Desc' columns.
# Identify missing values in 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4' columns
print()
print( "Missing values in column 'Crm Cd 1' are: " + str(df['Crm Cd 1'].isnull().sum()) )
print( "Missing values in column 'Crm Cd 2' are: " + str(df['Crm Cd 2'].isnull().sum()) )
print( "Missing values in column 'Crm Cd 3' are: " + str(df['Crm Cd 3'].isnull().sum()) )
print( "Missing values in column 'Crm Cd 4' are: " + str(df['Crm Cd 4'].isnull().sum()) )
Based on the information we have from data set description, we understand that Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses. Lower crime class numbers are more serious. It is also indicated that value in 'Crm Cd' column is same as the value in 'Crm Cd 1' column.
Based on our analysis above, we know that there are NO missing values in 'Crm Cd' column. Hence, we would analyze the differences in values in each record between 'Crm Cd' and 'Crm Cd 1' columns and update them to be the same.
# Display the data type for columns 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4'
print()
print("Display the data type for columns 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4':")
print()
df[ ['Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4'] ].info()
# Since the codes are integer values, we want to change the data type to int
# Before changing the data type to int, we fill in -1 as the value for blank values
df['Crm Cd 1'].fillna(-1, inplace = True)
df['Crm Cd 2'].fillna(-1, inplace = True)
df['Crm Cd 3'].fillna(-1, inplace = True)
df['Crm Cd 4'].fillna(-1, inplace = True)
# Now convert the columns to datatype int
df[ ['Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4'] ] = df[ ['Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4'] ].astype(int)
# Display the data type for columns 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4'
print()
print("Display the data type for columns 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4':")
print()
df[ ['Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4'] ].info()
# Display the records where the values in 'Crm Cd' and 'Crm Cd 1' columns are different.
df[ (df['Crm Cd']) != (df['Crm Cd 1']) ][ ['Crm Cd', 'Crm Cd Desc', 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4'] ]
As seen in the output above, there are 3632 records where the value in 'Crm Cd' is different than 'Crm Cd 1' but has been used in other columns such as 'Crm Cd 2'. We will assume that there is a valid reason for categorizing Crime Codes in such a way. Hence, NO further modifications would be made to these columns.
# Identify missing values in 'Mocodes'
print()
print( "Missing values in column 'Mocodes' are: " + str(df['Mocodes'].isnull().sum()) )
# Display records with missing 'Mocodes'
df[ (df['Mocodes']).isnull() ][ ['Crm Cd', 'Crm Cd Desc', 'Mocodes'] ]
As seen in the output above, there are 228017 records with missing 'Mocodes'. We will assume that there is a valid reason for not having Mocodes for these records. Hence, NO further modifications would be made to this columns.
# Identify missing values in 'Weapon Used Cd' and 'Weapon Desc' columns
print()
print( "Missing values in column 'Weapon Used Cd' are: " + str(df['Weapon Used Cd'].isnull().sum()) )
print( "Missing values in column 'Weapon Desc' are: " + str(df['Weapon Desc'].isnull().sum()) )
# As seen in the output above, there is one record that does not have a 'Weapon Desc' but has a 'Weapon Used Cd'
# Display the one record that does not have a 'Weapon Desc' but has a 'Weapon Used Cd'
print()
print("The one record that does not have a 'Weapon Desc' but has a 'Weapon Used Cd' is:")
df[ (df['Weapon Desc'].isnull()) & (df['Weapon Used Cd'].notnull()) ][['Crm Cd', 'Crm Cd Desc', 'Weapon Used Cd', 'Weapon Desc']]
# Display records grouped by 'Weapon Used Cd' and their corresponding 'Weapon Desc'
print()
print("Records grouped by 'Weapon Used Cd' and their corresponding 'Weapon Desc':")
df.groupby(['Weapon Used Cd', 'Weapon Desc']).size()
Based on the outputs above we have the following observations for 'Weapon Used Cd' and 'Weapon Desc' columns:
Meaning these 1404863 crimes did not involve any weapon.
There is one record that does not have a 'Weapon Desc' but has a 'Weapon Used Cd'. Looking closely at the record output it is clear that the crime was committed with a 'DEADLY WEAPON'. So we will conclude that the description of the weapon was not available. Since this is just one record, we will leave the record as is.
# Identify missing values in 'Premis Cd' and 'Premis Desc' columns
print()
print( "Missing values in column 'Premis Cd' are: " + str(df['Premis Cd'].isnull().sum()) )
print( "Missing values in column 'Premis Desc' are: " + str(df['Premis Desc'].isnull().sum()) )
# Display records grouped by 'Premis Cd' and their corresponding 'Premis Desc'
print()
print("Records grouped by 'Premis Cd' and their corresponding 'Premis Desc':")
df.groupby(['Premis Cd', 'Premis Desc']).size()
# Display the data type for column 'Premis Cd'
print()
print("Display the data type for column 'Premis Cd':")
print()
df[ ['Premis Cd'] ].info()
# Since the codes are integer values, we want to change the data type to int
# Before changing the data type to int, we fill in -1 as the value for blank values
df['Premis Cd'].fillna(-1, inplace = True)
# Now convert the columns to datatype int
df['Premis Cd'] = df['Premis Cd'].astype(int)
# Display the data type for column 'Premis Cd'
print()
print("Display the data type for column 'Premis Cd':")
print()
df[ ['Premis Cd'] ].info()
# Display missing values in 'Premis Cd' column
print()
print( "Missing values in column 'Premis Cd' are: " + str(df['Premis Cd'].isnull().sum()) )
# Identify missing values in 'LOCATION' and 'Cross Street' columns
print()
print( "Missing values in column 'LOCATION' are: " + str(df['LOCATION'].isnull().sum()) )
print( "Missing values in column 'Cross Street' are: " + str(df['Cross Street'].isnull().sum()) )
Based on the outputs above we have the following observations for 'LOCATION' and 'Cross Street' columns
Hence, we have concluded not to make any modifications to these columns.
# Identify missing values in 'DR_NO', 'Date Rptd', 'DATE OCC', 'TIME OCC', 'Rpt Dist No', 'Part 1-2' columns
print()
print( "Missing values in column 'DR_NO' are: " + str(df['DR_NO'].isnull().sum()) )
print( "Missing values in column 'Date Rptd' are: " + str(df['Date Rptd'].isnull().sum()) )
print( "Missing values in column 'DATE OCC' are: " + str(df['DATE OCC'].isnull().sum()) )
print( "Missing values in column 'TIME OCC' are: " + str(df['TIME OCC'].isnull().sum()) )
print( "Missing values in column 'Rpt Dist No' are: " + str(df['Rpt Dist No'].isnull().sum()) )
print( "Missing values in column 'Part 1-2' are: " + str(df['Part 1-2'].isnull().sum()) )
# Display min and max values from 'TIME OCC' column so as to verify if the time values had an errors
print()
print("Minimum value in 'TIME' column: " + str(df['TIME OCC'].min()) )
print("Maximum value in 'TIME' column: " + str(df['TIME OCC'].max()) )
As seen above there are NO errors in the TIME OCC column. The time values fall within the range of 0000 till 2359 (Military time). And there are NO missing values in Columns: 'DR_NO', 'Date Rptd', 'DATE OCC', 'TIME OCC', 'Rpt Dist No', 'Part 1-2'.
Below is the summary of the observations when validating missing values in each of the columns in the data set:
# Identify duplicate rows in the entire data set, using information from all the columns
# The below code marks duplicates as 'True' except for the first occurrence.
# df.duplicated(subset=None, keep='first')
print()
print("Grouping non-duplicates into False bucket and duplicates into True bucket:")
df.groupby( [df.duplicated(subset=None, keep='first')] ).size()
As seen in the output above, There are NO duplicate rows in the current data set.
From the outputs of "df.info()" and "display(HTML(data_desc.to_html()))" given above, we conclude that there are NO duplicate columns in the current data set.
However, it is important to note that the information in columns "Crm Cd" and "Crm Cd 1" is basically the same. So when building models we would make use of only one of these columns.
When analyzing each individual column in the data set the only outliers we encountered were two individuals with age 114 and 118. The same is graphically represented in the box-plot below.
Please refer to section "Exploring Column: Vict Age" for more details.
# Box plot to display outliers in the 'Vict Age' column.
df.boxplot(column='Vict Age', return_type='axes');
# Summary of attributes in the dataframe
df.describe().round(2)
Many of our variables are categorical (even the integer based ones depicting codes). This makes means and medians a little deceptive because these categorical numbers are not ordinal. Many of the simple statistics for these categorical variables can be found in the Data Quality section during imputation.
print("The mean time for a crime is around early afternoon at " + str(df['TIME OCC'].mean().round(0)))
print("The median time for a crime is a little later at " + str(df['TIME OCC'].median()))
print("The range of times contains the full clock cycle between " + str(df['TIME OCC'].min())+ " and " + str(df['TIME OCC'].max()))
We can see this comes from two separate peaks around the noon hour (1200) and then at 8 PM (2000). It also shows the continuation of late night highs into early morning crime occurrences between 12 AM (0) and 2 AM (200). We will be interested in digging into possible difference in the types of crimes that occurr at these two different peaks. One mid-day, while people tend to be at work. The other during the evening around the time we expect it is getting dark.
df.hist(column='TIME OCC')
plt.xlabel('Time Crime Occurred')
plt.title('Time Crime Occurred (Military Time)')
Victim Age skewed younger and is heavily concentrated around the mid 20s to mid forties.
print("The mean victim age is " + str(df['Vict Age'].mean()))
print("The median victim age is " + str(df['Vict Age'].median()))
print("The range of victim ages is between " + str(df['Vict Age'].min())+ " and " + str(df['Vict Age'].max()))
This histogram of victims by ages lets us see that there is a right skew to the distribution of ages. Concentrated around the mid 20s to mid 40s. The peak is around the earlier noted mean in the thirties. It is interesting that despite having a large range, there is a heavily skewed focus toward the younger adult population.
df.hist(column='Vict Age')
plt.xlabel('Age')
plt.title('Victim Counts by Age')
Crime is reported in every LAPD Division area. However there are really two divisions that pull out ahead of the rest: Southwest and 77th Street. It will be interesting to see if the type of crimes vary in a higher crime district or if it is just volume.
df_area = df.groupby(by=['AREA NAME'])
area_crime_count = df_area['AREA NAME'].count()
area_crime_count.plot.barh(title= 'Crime by Area Reported')
One third of the crimes involved some weapon code as part of the crime committed.
totalSizeOfRows = df.shape[0]
totalWeaponsUsed = sum(df['Weapon Used Cd'] > 0)
percentageUsingGuns = (totalWeaponsUsed / totalSizeOfRows) * 100
print(str(percentageUsingGuns)+"% of the total crimes reported had a weapon reported as part of the crime")
print("The range of weapon codes is between " + str(df['Weapon Used Cd'].min())+ " and " + str(df['Weapon Used Cd'].max()))
We see the most weapon code usage in the 400 and 500 Ranges
df.hist(column='Weapon Used Cd')
plt.xlabel('Weapon Code')
plt.title('Weapon Counts by Code')
The vast majority of crimes in this dataset are unsolved and sit in the status of continuing investigation (Invest Cont). Those crimes that do have a resolved status primarily involve adult arrests or or other resolutions.
#count the number of unsolved cases in a status unsolved
totalUnsolved = sum(df['Status'] == "IC")
percentageUnsolved = (totalUnsolved / totalSizeOfRows) * 100
print(str(percentageUnsolved)+"% of the total crimes reported are currently unsolved")
df_area = df.groupby(by=['Status Desc'])
area_crime_count = df_area['Status Desc'].count()
area_crime_count.plot.barh(title= 'Status Counts by Description')
# Extracting Year of Crime as an attribute
df['year'] = pd.DatetimeIndex(df['DATE OCC']).year
# Extracting Month of Crime as an attribute
df['month'] = pd.DatetimeIndex(df['DATE OCC']).month_name()
# Extracting Hour of Day as an attribute
df['DayHr'] = pd.cut(df['TIME OCC'],[0,59,159,259,359,459,559,659,759,859,959,1059,1159,1259,1359,1459,1559,1659,1759,1859,1959,2059,2159,2259,2359],24,labels=['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24'])
# Extracting Age Group as an attribute
df['AgeGroup'] = pd.cut(df['Vict Age'],[1,12,19,35,65,120],5,labels=['PreTeen','Teen','YoungAdult','Adult','Senior'])
# Reference code : DataMiningNotebooks-master/02. DataVisualization.ipynb
plt.style.use('ggplot')
df_area = df.groupby(by=['AREA NAME'])
area_crime_count = df_area['AREA NAME'].count()
area_crime_count.sort_values().plot.barh(title= 'Crime Reported Over Last 10 Years')
# Reference code : https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
df_dayhr_tmp = df.groupby('DayHr').count().reset_index().iloc[:,[0,1]]
df_dayhr = df_dayhr_tmp.rename(columns={df_dayhr_tmp.columns[0]: "HrOfDay", df_dayhr_tmp.columns[1]: "Count"})
df_dayhr
plt.hlines(df_dayhr.Count.mean(),0,24,color="red")
plt.text(4,105000,"Mean Hourly \nCrime Reports",color ="red", horizontalalignment = "center", verticalalignment = "center")
plt.bar(df_dayhr.HrOfDay,df_dayhr.Count,color= "blue", alpha = 0.6)
plt.title('Crime Count by the hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Total number of crimes reported')
# Reference code
# 1. DataMiningNotebooks-master/02. DataVisualization.ipynb
# 2. Reference code : https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
df['DayType'] = pd.cut(df['TIME OCC'],[0,359,759,1159,1559,1959,2359],6,labels=['00hrs - 04hrs','04hrs - 08hrs','08hrs - 12hrs','12hrs - 16hrs','16hrs - 20hrs','20hrs - 00hrs'])
df_daytime_tmp = df.groupby('DayType').count().reset_index().iloc[:,[0,1]]
df_daytime = df_daytime_tmp.rename(columns={df_daytime_tmp.columns[0]: "TimeOfDay", df_daytime_tmp.columns[1]: "Count"})
ax1 = plt.pie(df_daytime.Count,labels=df_daytime.TimeOfDay,autopct='%1.1f%%')
plt.title('% Crime Reports in 4hrs window')
df_crimecode_grp = df.groupby(by=['Crm Cd','Crm Cd Desc']).count().reset_index().iloc[:,[0,1,2]]
df_crimecode = df_crimecode_grp.rename(columns={df_crimecode_grp.columns[0]: "Crime Code", df_crimecode_grp.columns[1]: "Description",df_crimecode_grp.columns[2]: "Count"})
df_crimecode['%']=100*df_crimecode.Count/df_crimecode.Count.sum()
df_crimecode.sort_values(['%'],inplace=True,ascending=False)
df_crimecode['Cum %']=df_crimecode['%'].cumsum()
#plt.bar(df_crimecode.head(10)['Crime Code'].apply(str),df_crimecode.head(10)['%'],align='center')
plt.barh(df_crimecode.head(10)['Description'],df_crimecode.head(10)['%'],align='center')
plt.xlabel('% Crime Reports')
plt.title('% Top 10 Crime Types')
plt.gca().invert_yaxis()
df_premiscode_grp = df.groupby(by=['Premis Cd','Premis Desc']).count().reset_index().iloc[:,[0,1,2]]
df_premiscode = df_premiscode_grp.rename(columns={df_premiscode_grp.columns[0]: "Premis Code", df_premiscode_grp.columns[1]: "Description",df_premiscode_grp.columns[2]: "Count"})
df_premiscode['%']=100*df_premiscode.Count/df_premiscode.Count.sum()
df_premiscode.sort_values(['%'],inplace=True,ascending=False)
df_premiscode['Cum %']=df_premiscode['%'].cumsum()
#plt.bar(df_crimecode.head(10)['Crime Code'].apply(str),df_crimecode.head(10)['%'],align='center')
plt.barh(df_premiscode.head(10)['Description'],df_premiscode.head(10)['%'],align='center')
plt.gca().invert_yaxis()
plt.xlabel('% Crime Reports')
plt.title('% Top 10 Crime Premises')
# Reference code
# 1. Reference code : https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
df_weaponcode_grp = df.groupby(by=['Weapon Used Cd','Weapon Desc']).count().reset_index().iloc[:,[0,1,2]]
df_weaponcode = df_weaponcode_grp.rename(columns={df_weaponcode_grp.columns[0]: "Weapon Code", df_weaponcode_grp.columns[1]: "Description",df_weaponcode_grp.columns[2]: "Count"})
df_weaponcode['%']=100*df_weaponcode.Count/df_weaponcode.Count.sum()
df_weaponcode.sort_values(['%'],inplace=True,ascending=False)
df_weaponcode['Cum %']=df_weaponcode['%'].cumsum()
plt.barh(df_weaponcode.head(10)['Description'],df_weaponcode.head(10)['%'],align='center')
plt.gca().invert_yaxis()
plt.xlabel('% Crime Reports')
plt.title('% Top 10 Weapons used in Crime')
Looking at the overall boundaries, it is apparent that crime happens across ages in the tighter city limits at the lower bottom left.
# Reference code
# 1. Reference code : https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
# 2. Reference code : https://plotly.com/python/treemaps/
df_ad_grp = df.groupby(['AREA NAME', 'Vict Descent']).count().reset_index().iloc[:,[0,1,2]]
df_ad = df_ad_grp.rename(columns={df_ad_grp.columns[0]: "Area", df_ad_grp.columns[1]: "Descent", df_ad_grp.columns[2]: "Count"})
fig = px.treemap(df_ad, path=['Area', 'Descent'],values='Count')
fig.update_layout(title="Crime Area & Victim Descent Treemap",width=800, height=500,title_x=0.5)
fig.show()
The data should be reviewed ideally in context of actual demographics in the area, to see if a particular community tends to get targeted disportionately to their population in an area. Such demographic data is not part of current work, and could be considered to be added in future for additional inference.
# Reference code
# 1. Reference code : https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
df_area_yr_grp = df.groupby(by=['AREA NAME','year']).count().reset_index().iloc[:,[0,1,2]]
df_area_yr = df_area_yr_grp.rename(columns={df_area_yr_grp.columns[0]: "Area", df_area_yr_grp.columns[1]: "Year",df_area_yr_grp.columns[2]: "Count"})
df_area_yr_pivot = df_area_yr.pivot(index='Area', columns='Year', values='Count').transpose()
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['77th Street'],marker='o',label= '77th Street')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Southwest'],marker='o',label= 'Southwest')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['N Hollywood'],marker='o',label= 'N Hollywood')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Pacific'],marker='o',label= 'Pacific',linewidth=3)
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Southeast'],marker='o',label= 'Southeast')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Mission'],marker='o',label= 'Mission')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Northeast'],marker='o',label= 'Northeast')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Newton'],marker='o',label= 'Newton')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Van Nuys'],marker='o',label= 'Van Nuys')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Hollywood'],marker='o',label= 'Hollywood')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Central'],marker='o',label= 'Central',linewidth=3, color ='red')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Topanga'],marker='o',label= 'Topanga')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Devonshire'],marker='o',label= 'Devonshire')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Olympic'],marker='o',label= 'Olympic')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Harbor'],marker='o',label= 'Harbor')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Rampart'],marker='o',label= 'Rampart')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['West Valley'],marker='o',label= 'West Valley')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['West LA'],marker='o',label= 'West LA')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Wilshire'],marker='o',label= 'Wilshire')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Foothill'],marker='o',label= 'Foothill')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Hollenbeck'],marker='o',label= 'Hollenbeck')
plt.ylim(5000,20000)
plt.legend(loc='lower center',bbox_to_anchor=(1.2, -0.3))
plt.xlabel('Year')
plt.ylabel('Number of Crime Reports')
plt.title('LA Crime Over 10 Years \n (Different Policing Areas)')
In below visualizations, additional attributes are investigated for Central Area to explore what's driving increase in crime?
# Reference code
# 1. Reference code : https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
# 2. Reference code : https://stackoverflow.com/questions/26355313/plotting-multiple-plots-generated-inside-a-for-loop-on-the-same-axes-python
df_cent_crcd_yr_grp = df.loc[df['AREA NAME'] == 'Central'].groupby(by=['Crm Cd','Crm Cd Desc', 'year']).count().reset_index().iloc[:,[0,1,2,3]]
df_sex_yr_grp = df.groupby(by=['Vict Sex','year']).count().reset_index().iloc[:,[0,1,2]]
df_cent_crcd_yr = df_cent_crcd_yr_grp.rename(columns={df_cent_crcd_yr_grp.columns[0]: "CrimeCode", df_cent_crcd_yr_grp.columns[1]: "Description",df_cent_crcd_yr_grp.columns[2]: "Year",df_cent_crcd_yr_grp.columns[3]: "Count"})
df_cent_crcd_yr_pivot = df_cent_crcd_yr.pivot(index='Description', columns='Year', values='Count').transpose()
for (columnName, columnData) in df_cent_crcd_yr_pivot.iteritems():
incThresh = float(columnData.values[9])/float(columnData.values[4])
if (incThresh > 10):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='8',label= [str(columnName)],linewidth=4,color='purple')
if (incThresh > 5 and (incThresh <= 10)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='s',label= [str(columnName)],linewidth=3,color='red')
if ((incThresh > 4) and (incThresh <= 5)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='o',label= [str(columnName)],linewidth=2,linestyle='--',color='blue')
if ((incThresh > 3) and (incThresh <= 4)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='*',label= [str(columnName)],linewidth=1,linestyle='-.',color='black')
if ((incThresh > 2) and (incThresh <= 3)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='d',label= [str(columnName)],linewidth=0.5,linestyle=':',color='grey')
plt.ylim(0.1,100)
plt.yscale('log')
plt.legend(loc='right',bbox_to_anchor=(2.2,0.5))
plt.xlabel('Year')
plt.ylabel('Crime reports normalized to yr 2014')
plt.title('Increasing Crime Types in Central Area')
df_cent_crcd_yr_grp = df.loc[df['AREA NAME'] == 'Central'].groupby(by=['Crm Cd','Crm Cd Desc', 'year']).count().reset_index().iloc[:,[0,1,2,3]]
df_sex_yr_grp = df.groupby(by=['Vict Sex','year']).count().reset_index().iloc[:,[0,1,2]]
df_cent_crcd_yr = df_cent_crcd_yr_grp.rename(columns={df_cent_crcd_yr_grp.columns[0]: "CrimeCode", df_cent_crcd_yr_grp.columns[1]: "Description",df_cent_crcd_yr_grp.columns[2]: "Year",df_cent_crcd_yr_grp.columns[3]: "Count"})
df_cent_crcd_yr_pivot = df_cent_crcd_yr.pivot(index='Description', columns='Year', values='Count').transpose()
for (columnName, columnData) in df_cent_crcd_yr_pivot.iteritems():
incThresh = float(columnData.values[9])/float(columnData.values[4])
if (incThresh > 10):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='8',label= [str(columnName)],linewidth=4,color='purple')
if (incThresh > 5 and (incThresh <= 10)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='s',label= [str(columnName)],linewidth=3,color='red')
if ((incThresh > 4) and (incThresh <= 5)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='o',label= [str(columnName)],linewidth=2,linestyle='--',color='blue')
if ((incThresh > 3) and (incThresh <= 4)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='*',label= [str(columnName)],linewidth=1,linestyle='-.',color='black')
if ((incThresh > 2) and (incThresh <= 3)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='d',label= [str(columnName)],linewidth=0.5,linestyle=':',color='grey')
plt.ylim(0.1,100)
plt.yscale('log')
plt.legend(loc='right',bbox_to_anchor=(2.2,0.5))
plt.xlabel('Year')
plt.ylabel('Crime reports normalized to yr 2014')
plt.title('Increasing Crime Types in Central Area')
# Reference code : https://stackoverflow.com/questions/13091649/unique-plot-marker-for-each-plot-in-matplotlib/41035147
# Reference code : https://stackoverflow.com/questions/5237611/itertools-cycle-next
colorv=cm.rainbow(np.linspace(0,1,19))
icolor = 1
marker = itertools.cycle((',', '+', '.', 'o', '*'))
df_cent_des_yr_grp = df.loc[df['AREA NAME'] == 'Central'].groupby(by=['Vict Descent','year']).count().reset_index().iloc[:,[0,1,2]]
df_cent_des_yr = df_cent_des_yr_grp.rename(columns={df_cent_des_yr_grp.columns[0]: "VictDesc", df_cent_des_yr_grp.columns[1]: "Year",df_cent_des_yr_grp.columns[2]: "Count"})
df_cent_des_yr_pivot = df_cent_des_yr.pivot(index='VictDesc', columns='Year', values='Count').transpose()
for (columnName, columnData) in df_cent_des_yr_pivot.iteritems():
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker=next(marker),label= [str(columnName)],linewidth=2)
icolor = icolor+1
plt.ylim(0.1,100)
plt.yscale('log')
plt.legend(loc='right',bbox_to_anchor=(1.2,0.5))
plt.xlabel('Year')
plt.ylabel('Crime reports normalized to yr 2014')
plt.title('Increasing Crime Types \n Central Area by Victim Descent')
# Reference code
# 1. Reference code : https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
# 2. Reference code : DataMiningNotebooks-master/02. DataVisualization.ipynb
df_yr_month_grp = df.groupby(['year', 'month']).count().reset_index().iloc[:,[0,1,2]]
df_yr_month = df_yr_month_grp.rename(columns={df_yr_month_grp.columns[0]: "Year", df_yr_month_grp.columns[1]: "Month", df_yr_month_grp.columns[2]: "Count"})
ax = sns.heatmap(df_yr_month.pivot(index='Month', columns='Year', values='Count'), linewidths=.5)
ax.set_title('LA Crime Heat Map \n Month x Year')
This was largely developed using a Geopandas tutorial found on Towards Datascience and using shape data of the City Boundaries from LA City Geo Hub. These shape files consist of many points and give the background of our map. This allows us to take the geolocational data and overlay it on a map of the LA City Boundaries.
#create a geopandas dataframe and convert lat/long to point geometry:
geometry = [Point(xy) for xy in zip(df["LON"], df["LAT"])]
geometry[:3]
#Give the Geodataframe our geometry created by lat/long
geo_df = gpd.GeoDataFrame(df, geometry = geometry)
This shape file from the LA City Geohub allows us to create a map of LA to plot the locational Longitude/Latitude coordinates
#Read in the Shape file to create a general outline of the LA City Boundaries
la_map = gpd.read_file('Data/shape/City_Boundaries.shp')
#Finally we are able to plot the data on the map
fig,ax = plt.subplots(figsize = (15,15))
la_map.plot(ax= ax, alpha = 0.4, color = "grey")
#starting with the largest group and decreasing to limit as much overlay as possible
geo_df[geo_df['Vict Descent']=='B'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Black')
geo_df[geo_df['Vict Descent']=='W'].plot(ax = ax, markersize = 10, marker = 'o', label = 'White')
geo_df[geo_df['Vict Descent']=='H'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Hispanic/Latin/Mexican')
geo_df[geo_df['Vict Descent']=='C'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Chinese')
geo_df[geo_df['Vict Descent']=='V'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Vietnamese')
geo_df[geo_df['Vict Descent']=='F'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Filipino')
geo_df[(geo_df['Vict Descent']=='A')].plot(ax = ax, markersize = 10, marker = 'o', label = 'Other Asian')
geo_df[geo_df['Vict Descent']=='D'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Cambodian')
geo_df[geo_df['Vict Descent']=='Z'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Asian Indian')
geo_df[geo_df['Vict Descent']=='G'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Guamanian')
geo_df[geo_df['Vict Descent']=='I'].plot(ax = ax, markersize = 10, marker = 'o', label = 'American Indian/Alaskan Native')
geo_df[geo_df['Vict Descent']=='J'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Japanese')
geo_df[geo_df['Vict Descent']=='K'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Korean')
geo_df[geo_df['Vict Descent']=='L'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Laotian')
geo_df[geo_df['Vict Descent']=='P'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Pacific Islander')
geo_df[geo_df['Vict Descent']=='S'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Samoan')
geo_df[geo_df['Vict Descent']=='U'].plot(ax = ax, markersize = 10, marker = 'o', label = 'Hawaiian')
plt.title('LA Crime By Descent of the Victim \n (Unknowns and Other excluded)')
plt.legend(prop={'size':15})
#Descent of the Victim (Code): A - Other Asian | B - Black | C - Chinese | D - Cambodian
# | F - Filipino | G - Guamanian | H - Hispanic/Latin/Mexican | I - American Indian/Alaskan Native
# | J - Japanese | K - Korean | L - Laotian | O - Other | P - Pacific Islander | S - Samoan
# | U - Hawaiian | V - Vietnamese |
#W - White | X - Unknown | Z - Asian Indian
There is some grouping of areas in the Victim Decent vs Latitude and Longitude. The largest clustered areas represent the LAPD's actual area. However, there are reported crimes outside of these LAPD boundaries. The most clustering can be seen with those with smaller represented populations in this crime dataset, such as many of the specific Asian populations: Korean, Vietnamese, Cambodian, etc.
Once we are able to link the areas to one of the datasets that connect a reporting district or area to an LAPD Bureau, we can look closer at these four sections of the LA. Once broken into smaller maps, it may also be better to compare larger populations by their own maps.
Below is an example of the 4 LAPD Bureaus which make up the majority of the mapped crime in the greater LA City Area
#Read in the Shape file to create a general outline of the LAPD Bureaus
lapd = gpd.read_file('Data/shape/LAPD_Bureaus.shp')
#Finally we are able to plot the data on the map
fig,ax = plt.subplots(figsize = (15,15))
lapd.plot(ax= ax)
# Reference code
# 1. Reference code : https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
df_sex_yr_grp = df.groupby(by=['Vict Sex','year']).count().reset_index().iloc[:,[0,1,2]]
df_sex_yr = df_sex_yr_grp.rename(columns={df_sex_yr_grp.columns[0]: "Sex", df_sex_yr_grp.columns[1]: "Year",df_sex_yr_grp.columns[2]: "Count"})
df_sex_yr_pivot = df_sex_yr.pivot(index='Sex', columns='Year', values='Count').transpose()
plt.plot( df_sex_yr_pivot.index.values,df_sex_yr_pivot['M'],marker='o',label= 'M',linewidth=2,color="blue")
plt.plot( df_sex_yr_pivot.index.values,df_sex_yr_pivot['F'],marker='o',label= 'F',linewidth=2,color="red")
plt.ylim(50000,120000)
plt.legend(loc='lower center',bbox_to_anchor=(1.2,0.3))
plt.xlabel('Year')
plt.ylabel('Number of Crime Reports')
plt.title('LA Crime Over 10 Years \n (Males Vs Females)')
# Reference code
# 1. Reference code : https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
df_sex_hr_grp = df.groupby(by=['Vict Sex','DayHr']).count().reset_index().iloc[:,[0,1,2]]
df_sex_hr = df_sex_hr_grp.rename(columns={df_sex_hr_grp.columns[0]: "Sex", df_sex_hr_grp.columns[1]: "DayHr",df_sex_hr_grp.columns[2]: "Count"})
df_sex_hr_pivot = df_sex_hr.pivot(index='Sex', columns='DayHr', values='Count').transpose()
#df_sex_hr_pivot
plt.plot( np.array(df_sex_hr_pivot.index.values),df_sex_hr_pivot['M'],marker='o',label= 'M',linewidth=2,color="blue")
plt.plot( np.array(df_sex_hr_pivot.index.values),df_sex_hr_pivot['F'],marker='o',label= 'F',linewidth=2,color="red")
#plt.ylim(50000,120000)
plt.legend(loc='lower center',bbox_to_anchor=(1.2,0.3))
plt.xlabel('Hr of Day')
plt.ylabel('Number of Crime Reports')
plt.title('Crime Reports over Hour of Day \n (Males Vs Females)')
# Reference code
# 1. Reference code : https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
# 2. Reference code : https://plotly.com/python/treemaps/
df_as_grp = df.groupby(['AREA NAME', 'Vict Sex']).count().reset_index().iloc[:,[0,1,2]]
df_as = df_as_grp.rename(columns={df_as_grp.columns[0]: "Area", df_as_grp.columns[1]: "Victim Sex", df_as_grp.columns[2]: "Count"})
fig = px.treemap(df_as, path=['Area', 'Victim Sex'],values='Count')
fig.update_layout(title="Crime Area & Victim Sex Treemap",width=800, height=500,title_x=0.5)
fig.show()
# Reference code
# 1. Reference code : https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
# 2. Reference code : https://plotly.com/python/treemaps/
df_aa_grp = df.groupby(['AREA NAME', 'AgeGroup']).count().reset_index().iloc[:,[0,1,2]]
df_aa = df_aa_grp.rename(columns={df_aa_grp.columns[0]: "Area", df_aa_grp.columns[1]: "Victim Age Group", df_aa_grp.columns[2]: "Count"})
fig = px.treemap(df_aa, path=['Area', 'Victim Age Group'],values='Count')
fig.update_layout(title="Crime Area & Victim Age Group Treemap",width=800, height=500,title_x=0.5)
fig.show()
Adult and Young Adult dominate top two crime age groups in all areas. The third spot is either taken by crimes on teens or senior citizens are mentioned below.
# Reference code : https://seaborn.pydata.org/examples/grouped_boxplot.html
sns.boxplot(x="DayHr", y="Vict Age",data=df)
plt.title('Victim Age distribution vs hour of day')
plt.xlabel('Hour of the Day')
#Finally we are able to plot the data on the map
fig,ax = plt.subplots(figsize = (15,15))
la_map.plot(ax= ax, alpha = 0.4, color = "grey")
#starting with the largest group and decreasing to limit as much overlay as possible
geo_df[(geo_df['Vict Age']>=18) & (geo_df['Vict Age']<=40)].plot(ax = ax, markersize = 10, color = "red", marker = 'o', label = 'Victim Aged 18 - 40')
geo_df[geo_df['Vict Age']>40].plot(ax = ax, markersize = 10, color = "orange", marker = 'o', label = 'Victim Over 40')
geo_df[geo_df['Vict Age']<18].plot(ax = ax, markersize = 10, color = "blue", marker = 'o', label = 'Victim Under 18')
plt.title('LA Crime By General Age groups \n (Under 18, 18 to 40, and Over 40)')
plt.legend(prop={'size':15})
When we are able to map smaller bureau areas, it may be useful to overlay again and check for clusterings at a closer level.
#Same plot, but looking to see if there are Sex specific areas that appear
fig,ax = plt.subplots(figsize = (15,15))
la_map.plot(ax= ax, alpha = 0.4, color = "grey")
geo_df[(geo_df['Vict Sex']=='M')].plot(ax = ax, markersize = 10, color = "blue", marker = 'o', label = 'Male')
geo_df[geo_df['Vict Sex']=='F'].plot(ax = ax, markersize = 10, color = "red", marker = 'o', label = 'Female')
plt.title('LA Crime by Victim Sex \n (Males Vs Females)')
plt.legend(prop={'size':15})
In the future we would like to get more data from the reported and occurrence dates. By subtracting we would like to see how long it took to report the crime. We could also denote crime occurrences by the day of the week they occured and whether it was on a weekend or not.
We could also use these dates in conjunctions with APIs to determine sunset time and get an idea of if it was day or night for day and time of occurrence. There is an open Sunrise-Sunset API that would take in the latitide and longitude and date to get precise times sunrise and sunset for that geolocation. This would then need to be compared to Time Occurred to flag the crime as Before Sunrise, Daytime, After Sunset. This data would help account for our intuitions that there are times after dark when crime has a second spike.
Another API could be used to see if the crime was committed on a Holiday. To get the most commonly given off Holidays, we could use Calendarific as they give the option to filter by type of holiday and limit to national holidays. For each year we would get the list of holidays marked as 'National'. Then from this list we would classify each crime as happening on a holiday or not by seeing if the Date Occurred is in the list.
Weapons not mentioned as present are represented as null in the Weapons Used Code field. This could be a single flag of Weapon Used that resolves to a true false. Again checking for nulls to get a charge count from the Crime code fields 1-4 could be put into a crime count field.
To make use of the MOCodes (Modus Operandi Codes) will take some processing to put them into a list and aggregate across the dataset. This could then be used to tally the most frequently used and dummy coding them as new columns that an incident either has or does not have. Rather than just adding potentially hundreds of new fields to each row.
The precincts are given and can be joined to other precinct datasets. They have a total area for the precinct in their LAPD Reporting Districts. With some aggregation of that LAPD Reporting District information we could find how many cars (BASICCAR field count across precinct) are assigned to the precint in the LAPD Basic Car Plan. This would give us an idea of the amount of police available to patrol area, and how much they are given the responsibility to patrol.
Below is the work that is above and beyond the requirements mentioned for Lab-1 which we want to claim as 'Exceptional Work':
# Creating dataframe with the required variables for PCA
df_pca = df[['TIME OCC', 'AREA', 'Rpt Dist No', 'LAT', 'LON', 'Crm Cd', 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4', 'Part 1-2']]
# PCA dataframe verification
df_pca.info()
df_pca.isnull().sum()
# PCA: Implement dimensionality reduction, then visualize and interpret the results.
# Adapted from exapmple by DR. Jake Drew: https://nbviewer.jupyter.org/github/jakemdrew/MachineLearningExtras/blob/master/LFW%20Dataset%20and%20Class%20Imbalance.ipynb
def plot_pca(X):
# Perform PCA on the data to reduce the number of initial features
# and to remove correlations that are common between pixel features
pca = PCA(n_components='mle')
pca.fit(X)
# Inspect the explained variances to determine how many components to use
plt.subplots(figsize=(8, 8))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
print('Cumulative Explained variance at 3 components:',sum(pca.explained_variance_ratio_[0:3]) )
plot_pca(df_pca)
Note: The following columns were used for PCA.
As seen above, the optimum number of components for this PCA is 3. The number of components to create was determined using MLE, or Maximum Likelihood Estimation. After 3 prinicple components are added, the cumulative explained variance reaches 93.7%, and does not increase significantly with subsequent additions of components.